[from_pretrained] Allow tokenizer_type ≠ model_type#6995
Conversation
sgugger
left a comment
There was a problem hiding this comment.
Not sure I fully understand the use case, but nothing against the principle of it.
| tokenizer_class_candidate = f"{config.tokenizer_class}Fast" | ||
| else: | ||
| tokenizer_class_candidate = config.tokenizer_class | ||
| tokenizer_class = globals().get(tokenizer_class_candidate) |
There was a problem hiding this comment.
Might be cleaner to use some of our internal list/dict containing all tokenizers, just in case there are weird things in the namespace of some users.
There was a problem hiding this comment.
Yes I was wondering about that. I was wondering if by using globals() someone could even use a tokenizer that's not in the library, but I don't think so, as globals is actually locals in this scope/file if I understand correctly.
LysandreJik
left a comment
There was a problem hiding this comment.
Cool, this looks good to me! Thanks for working on it.
| tokenizer_class_candidate = config.tokenizer_class | ||
| tokenizer_class = globals().get(tokenizer_class_candidate) | ||
| if tokenizer_class is None: | ||
| raise ValueError("Tokenizer class {} does not exist or is not currently imported.") |
There was a problem hiding this comment.
The content of the {} is missing? Or there is a magic somewhere in ValueError which fills this?
The idea is to prevent combinatorial explosion of "model types" when only the tokenizer is different (e.g. Flaubert, CamemBERT if we wanted to support them today) In the future we might even want to have a few model-agnostic tokenizer classes like ByteLevelBPETokenizer (basically RobertaTokenizer), as they can be initialized pretty exhaustively from the init args stored in |
For an exemple usage of this PR, see the
tokenizer_classattribute in this config.json: https://s3.amazonaws.com/models.huggingface.co/bert/julien-c/dummy-diff-tokenizer/config.jsonInstead of a class, we could have used a
tokenizer_typebelonging to the set of allmodel_types, like"bert", etc. but it feels more restrictive, especially in case we start having tokenizer classes that are not obviously linked to a "model", like a potential "TweetTokenizer"Context: #6129
Update: documented by @sgugger in #8152